這 GPU 開發者的信條 建立了一種以功能完整性與架構解耦為首要原則的根本哲學,遠勝於純粹的吞吐量。在 ROCm 生態系統中,由於 HIP 支援極大的併行運算,我們將每個核心視為高風險、完全隔離的黑箱。
1. 正確性的至高地位
在 HIP 開發中,一個統計上不一致的「快速」結果就是失敗。我們優先確保整個 ROCm 堆疊 的可驗證數學正確性,再進行任何底層組合語言或暫存器壓力的優化。若缺乏準確性,效能毫無意義。
2. 隔離作為診斷的防護欄
透過強制主機端管理與裝置端執行之間的嚴格隔離——減少全域狀態與副作用——我們將非確定性的併行錯誤轉化為可重現的邏輯單元。
3. 記憶體/併行運算的宿命論
我們接受 記憶體損壞與競爭條件 是影響 GPU 效能的主要「天敵」。 HIP 是主要的底層程式設計介面因此,信條要求每一項新核心都應以保守的同步機制與明確的記憶體擁有權作為起始基準。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
According to the Creed, what is a statistically inconsistent 'fast' result considered?
An acceptable trade-off for real-time systems.
A failure.
A 'heuristic' optimization.
A driver-level anomaly.
✅ Correct!
Correctness is the foundation; a fast but wrong answer is useless in scientific and production computing.❌ Incorrect
The creed explicitly states that speed without verifiable correctness is a failure.QUESTION 2
Why is 'Isolation' emphasized in the GPU development workflow?
To prevent the GPU from accessing host memory.
To reduce the electricity consumption of the ROCm stack.
To transform non-deterministic concurrency bugs into reproducible logical units.
To hide kernel source code from other developers.
✅ Correct!
Isolation allows you to debug specific units without the noise of global state or asynchronous race conditions.❌ Incorrect
Isolation is a diagnostic strategy to make bugs reproducible.QUESTION 3
In the 'Hierarchy of Needs' for GPU development, what forms the wide base?
Peak TFLOPS Tuning.
Functional Correctness (CPU Parity).
Shared Memory Optimization.
Inline Assembly.
✅ Correct!
CPU parity ensures the mathematical logic is sound before GPU-specific complexities are added.❌ Incorrect
Check the pyramid visual: Functional Correctness is the widest, most critical layer.QUESTION 4
What does 'Memory/Concurrency Fatalism' imply for a developer?
Assuming that memory will never fail.
Accepting that race conditions are the primary predators of performance.
Ignoring error codes from hipMalloc.
Assuming the compiler handles all synchronization.
✅ Correct!
Fatalism here means recognizing the inherent dangers of parallel memory access and planning for them from the start.❌ Incorrect
Fatalism means assuming these errors WILL happen unless specifically prevented.QUESTION 5
What is the recommended first step when implementing a complex kernel like an FFT?
Optimize shared memory usage immediately.
Use inline PTX assembly for speed.
Implement a strictly isolated version using global memory and explicit synchronization.
Disable all error checking to measure raw latency.
✅ Correct!
Verified global memory logic serves as the 'Gold Standard' before introducing complex shared memory tiling.❌ Incorrect
Jumping to shared memory shuffles before verifying the logic violates the Creed's correctness-first rule.Case Study: The 'Fast but Wrong' Wavefront
Debugging a 3D Stencil Kernel
A developer migrates a 3D Wavefront Reconstruction kernel to ROCm. To maximize speed, they use volatile shared memory and skip hipDeviceSynchronize() calls. The output is 100x faster than the CPU but 2% of the values are slightly off-target during high-load production runs.
Q
Based on the GPU Developer's Creed, what is the immediate priority for this developer?
Solution:
The priority is Functional Correctness. The developer must revert the optimizations (shared memory/async) and implement a strictly isolated version using global memory and explicit synchronization to find the 'Golden Model' discrepancy.
The priority is Functional Correctness. The developer must revert the optimizations (shared memory/async) and implement a strictly isolated version using global memory and explicit synchronization to find the 'Golden Model' discrepancy.
Q
Which layer of the Hierarchy of Needs did the developer skip?
Solution:
The developer skipped the base layer (Functional Correctness) and the middle layer (Isolation & Safety) to jump directly to the narrow tip (Performance Tuning).
The developer skipped the base layer (Functional Correctness) and the middle layer (Isolation & Safety) to jump directly to the narrow tip (Performance Tuning).
Q
How does 'Isolation' help solve the 2% error rate in this scenario?
Solution:
By isolating the kernel and comparing it bit-for-bit against a CPU reference, the developer can determine if the error is a logical math flaw or a non-deterministic race condition caused by shared memory concurrency.
By isolating the kernel and comparing it bit-for-bit against a CPU reference, the developer can determine if the error is a logical math flaw or a non-deterministic race condition caused by shared memory concurrency.